Data Storage Systems - System Design Guide

A comprehensive guide to data-related storage and processing systems from a system design interview perspective.

Core Concepts Overview
Storage Systems
Processing Engines
System Design Considerations
Modern Data Architecture Patterns
Interview Checklist
Common Interview Scenarios

1. Core Concepts Overview

ETL (Extract → Transform → Load)

Definition: Data pipeline pattern where data is extracted from sources, transformed (cleaned, aggregated, enriched), and then loaded into the target system.

Purpose:

Complex business logic transformations before storage
Data quality enforcement at ingestion
Schema validation and normalization
Data enrichment and integration from multiple sources

When to Use:

Transformations are computationally intensive
Need to reduce data volume before loading
Target system has limited compute capacity
Compliance requires data cleansing before storage
Legacy systems with limited query capabilities

Typical Tools: Talend, Informatica, AWS Glue, Apache Airflow + Spark, Azure Data Factory

Example Flow:

MySQL DB → Extract → Spark Transformation → Load → Snowflake
Logs (S3) → Extract → Filter/Aggregate → Load → Redshift

Trade-offs:

✅ Reduced storage costs (only transformed data stored)
✅ Better performance for downstream queries
✅ Data quality guaranteed before loading
❌ Longer pipeline latency
❌ Less flexibility for reprocessing
❌ Schema changes require pipeline updates

ELT (Extract → Load → Transform)

Definition: Modern pattern where raw data is loaded first, then transformations run inside the data warehouse/lake.

Purpose:

Leverage powerful warehouse compute engines
Preserve raw data for flexibility
Enable faster ingestion
Support schema evolution

When to Use:

Using modern cloud warehouses (Snowflake, BigQuery)
Need to preserve raw data
Transformations might change frequently
Multiple teams need different views of same data
Real-time or near-real-time requirements

Advantages:

Schema-on-read flexibility
Raw data always available for reprocessing
Faster initial ingestion
Easier debugging (raw data visible)
Better support for iterative analytics

Example Flow:

API Data → Load to S3 → Load to Snowflake → Transform with dbt
Kafka Stream → Load to BigQuery → Transform with SQL

Trade-offs:

✅ Raw data preservation
✅ Faster time-to-insights
✅ More flexible reprocessing
❌ Higher storage costs
❌ Requires powerful warehouse compute
❌ Data quality issues propagate downstream

ETL vs ELT Comparison

Aspect	ETL	ELT
Transform Location	Before load (external)	After load (in-warehouse)
Compute	Separate processing cluster	Warehouse compute
Data Volume	Reduced before storage	Full raw data stored
Schema	Fixed upfront	Flexible, schema-on-read
Latency	Higher (transform first)	Lower (load raw quickly)
Cost	Lower storage, higher compute	Higher storage, leverage warehouse
Use Case	Legacy systems, complex pre-processing	Modern cloud warehouses
Best For	Teradata, Oracle	Snowflake, BigQuery, Databricks

2. Storage Systems

Data Warehouse

Definition: Structured, schema-on-write system optimized for analytics (OLAP) workloads.

Purpose:

Centralized repository for structured business data
Optimized for complex analytical queries
Historical data analysis
Business intelligence and reporting

Architecture Characteristics:

Columnar storage: Optimized for aggregate queries
Append-only: Immutable historical records
Optimized joins: Efficient cross-table analytics
Materialized views: Pre-computed aggregations
Query optimization: Cost-based optimizers

Data Modeling:

Star Schema: Fact table + dimension tables
Snowflake Schema: Normalized dimension tables
Fact Tables: Metrics, measurements (sales, clicks)
Dimension Tables: Descriptive attributes (users, products, time)

Examples:

Snowflake (cloud-native)
Amazon Redshift (AWS)
Google BigQuery (GCP)
Azure Synapse Analytics
Teradata (on-premise/legacy)

When to Use:

Structured business analytics
BI dashboards and reports
SQL-heavy workloads
Historical trend analysis
Regulatory reporting

Trade-offs:

✅ Excellent query performance for analytics
✅ SQL interface familiar to analysts
✅ Strong consistency guarantees
✅ Optimized for aggregations and joins
❌ Schema changes are costly
❌ Not ideal for unstructured data
❌ Higher cost per GB than data lakes
❌ Not designed for real-time updates

Data Lake

Definition: Repository that stores raw data in native format - structured, semi-structured, and unstructured.

Purpose:

Store massive volumes of raw data cheaply
Support diverse data types and formats
Enable exploratory analytics and ML
Serve as single source of truth

Schema Approach:

Schema-on-read: Structure applied at query time
Format flexibility: JSON, CSV, Parquet, Avro, logs, images, videos
No upfront modeling: Store first, structure later

Examples:

AWS S3 + Athena/Glue
Azure Data Lake Storage (ADLS)
Google Cloud Storage + BigQuery
Hadoop HDFS (on-premise)

Common Formats:

Parquet: Columnar, compressed, great for analytics
Avro: Row-based, schema evolution support
ORC: Optimized columnar format for Hive
JSON: Flexible but inefficient for large scale

When to Use:

Storing raw logs, events, clickstreams
IoT sensor data
Machine learning training data
Unstructured data (images, videos, documents)
Data science exploration
Cost-effective long-term storage

Challenges:

Data governance and cataloging
Data quality consistency
Query performance can be poor
Difficult to maintain ACID properties
"Data swamp" risk without proper management

Trade-offs:

✅ Very low storage cost
✅ Handles any data type
✅ Massive scalability
✅ Great for ML and data science
❌ No ACID guarantees (traditionally)
❌ Poor query performance without optimization
❌ Governance challenges
❌ Requires data catalog tools

Lakehouse Architecture

Definition: Hybrid approach combining data lake flexibility with data warehouse performance and reliability.

Purpose:

Single platform for all data workloads
ACID transactions on data lake storage
Schema enforcement with flexibility
Support both BI and ML workloads

Key Technologies:

Delta Lake (Databricks)
Apache Iceberg (Netflix/Apple)
Apache Hudi (Uber)

Features:

ACID transactions on object storage
Time travel and versioning
Schema enforcement and evolution
Unified batch and streaming
Z-ordering and data skipping

Medallion Architecture Layers:

When to Use:

Need both data lake economics and warehouse performance
Supporting diverse teams (data scientists + analysts)
Want to eliminate data silos
Need strong consistency with S3-like costs

Trade-offs:

✅ Best of both worlds (lake + warehouse)
✅ Lower cost than pure warehouse
✅ ACID guarantees on cheap storage
✅ Eliminates data duplication
❌ More complex to set up
❌ Newer technology (less mature)
❌ Requires learning new concepts

3. Processing Engines

Apache Spark

Definition: Distributed compute engine for large-scale data processing (batch and streaming).

Purpose:

Process massive datasets in parallel
Complex transformations and aggregations
Machine learning pipelines
Unified batch and streaming

Architecture Components:

Driver: Coordinates work, builds execution plan
Executors: Perform computations on worker nodes
Cluster Manager: Resource allocation (YARN, Kubernetes, Mesos)

APIs:

Spark SQL (DataFrames) - Structured data processing
RDD (Resilient Distributed Datasets) - Low-level API
Structured Streaming - Stream processing
MLlib - Machine learning library
GraphX - Graph processing

Processing Model:

When to Use:

ETL pipelines with complex logic
Large-scale data transformations (TB/PB scale)
Machine learning at scale
Batch processing of historical data
Stream processing with Structured Streaming (second-level latency acceptable)

Common Deployment Pattern:

Trade-offs:

✅ Massive scalability (handles PB-scale data)
✅ Rich transformation APIs (SQL, DataFrame, RDD)
✅ Supports multiple languages (Scala, Python, Java, R)
✅ Unified batch and streaming
✅ In-memory processing for speed
❌ High operational complexity
❌ Expensive for small workloads
❌ Requires cluster management expertise
❌ Higher latency than true streaming (Flink)

Snowflake

Definition: Cloud-native data warehouse with separated compute and storage architecture.

Architecture Innovation:

Storage Layer: Centralized, automatic replication, micro-partitions
Compute Layer: Multiple independent virtual warehouses
Cloud Services: Metadata, optimization, security, transaction management

Key Features:

Separation of Compute and Storage

Zero-Copy Cloning: Instant data copies without storage duplication
Time Travel: Query historical data (up to 90 days)
Multi-cluster Warehouses: Auto-scaling for high concurrency
Data Sharing: Share live data between accounts without copying
Secure Views: Row/column-level security

When to Use:

Business intelligence and analytics
Data warehousing with variable workloads
Need to scale read and write independently
Multiple teams with different compute needs
Cross-organization data sharing
Ad-hoc analytics with varying concurrency

Design Patterns:

Create separate warehouses for ETL, BI, Data Science
Use clustering keys for large tables (>1TB)
Materialize views for repeated aggregations
Partition by time for time-series data
Use multi-cluster warehouses for high concurrency

Trade-offs:

✅ Scales compute independently from storage
✅ No infrastructure management (fully managed)
✅ Excellent concurrency handling
✅ Pay only for compute used (per-second billing)
✅ Automatic optimization and tuning
❌ Can be expensive at scale (compute costs add up)
❌ Vendor lock-in
❌ Not ideal for real-time updates (optimized for batch/micro-batch)
❌ Limited support for unstructured data

4. System Design Considerations

Data Flow Architecture

Typical Modern Data Stack:

Key Design Questions

When designing a data analytics system, always ask these questions:

1. Data Characteristics

Volume: How much data per day/hour/minute?
Velocity: Real-time, near real-time, micro-batch, or batch?
Variety: Structured, semi-structured, or unstructured?
Veracity: What are data quality and consistency needs?

2. Access Patterns

Read-heavy or write-heavy?
Query latency requirements? (milliseconds vs seconds vs minutes)
Number of concurrent users?
Ad-hoc vs predefined queries?
Point queries vs analytical aggregations?

3. Processing Requirements

Transformation complexity? (simple filters vs complex joins)
Need for joins across multiple sources?
Stateful vs stateless processing?
Reprocessing/backfill requirements?
Data lineage and audit needs?

4. Scalability Needs

Expected data growth rate?
Geographic distribution requirements?
Peak vs average load patterns?
Data retention and archival policies?

Storage Selection Matrix

Scenario	Best Choice	Reason
Raw logs & ML data	Data Lake (S3/ADLS)	Schema-on-read, cost-effective, supports any format
Business reports & BI	Data Warehouse (Snowflake/BigQuery)	Optimized for OLAP, SQL interface
Complex ETL pipelines	Spark / AWS Glue	Distributed compute, handles TB/PB scale
Unified raw + analytics	Lakehouse (Delta/Iceberg)	Combines lake economics with warehouse performance
Real-time analytics	Kafka + Flink + Warehouse	Low latency streaming pipeline
High concurrency BI	Snowflake / BigQuery	Scales compute separately, multi-cluster support
Operational analytics	Druid / ClickHouse	Fast aggregations, sub-second queries
Time-series data	InfluxDB / TimescaleDB / Druid	Optimized for time-based queries
Search & text analytics	Elasticsearch	Full-text search, log analytics
Graph relationships	Neo4j / Amazon Neptune	Graph traversal, relationship queries

Scaling Strategies

Ingestion Scaling

Strategies:

Message queues (Kafka, Kinesis) for buffering
Partition by key for parallel processing
Rate limiting at source
Back-pressure mechanisms
Dead letter queues for failed messages

Storage Scaling

Strategies:

Separate compute from storage (Snowflake model)
Horizontal partitioning (sharding by key)
Time-based partitioning (year/month/day)
Tiered storage (hot/warm/cold based on access patterns)
Compression and columnar formats

Query Performance Optimization

Techniques:

Materialized views for repeated aggregations
Clustering keys / Sort keys for common filters
Data skipping with statistics (min/max, bloom filters)
Result caching for identical queries
Query federation across multiple sources
Partition pruning to scan only relevant data

Concurrency Handling

Strategies:

Multi-cluster compute (Snowflake multi-cluster warehouses)
Read replicas for read-heavy workloads
Connection pooling to manage connections efficiently
Query queuing and prioritization
Workload management (separate ETL from BI queries)

5. Modern Data Architecture Patterns

Lambda Architecture

Concept: Separate batch and speed layers for comprehensive data processing.

Layers:

Batch Layer: Historical data processed with Spark/MapReduce on HDFS/S3
- Complete and accurate
- High latency (hours/days)
- Immutable dataset
Speed Layer: Real-time data processed with Flink/Storm from Kafka
- Low latency (seconds/minutes)
- Approximate results
- Compensates for batch layer lag
Serving Layer: Merged views from both layers (Druid, Cassandra, HBase)
- Queries combine batch and real-time data
- Application-facing API

Purpose: Handle both batch and real-time processing with different latency/accuracy trade-offs

Use Cases:

Real-time dashboards with historical context
Fraud detection with learning from history
Recommendation systems

Challenges:

Complexity of maintaining two separate code paths
Data consistency between layers
Operational overhead

Kappa Architecture

Concept: Simplified approach treating everything as a stream.

Simplified Approach:

Single stream processing pipeline for all data
Replayable message queue (Kafka with long retention)
Same code handles both real-time and batch

Components:

Stream Storage: Kafka with days/weeks of retention
Processing: Flink, Spark Streaming, Kafka Streams
Serving: Cassandra, Elasticsearch, Druid

Advantages:

Single codebase (no duplicate logic)
Easier to maintain and debug
Flexibility to reprocess by replaying stream

When to Use:

All data can be modeled as events/streams
Team has strong stream processing expertise
Want to avoid complexity of Lambda

Limitations:

Requires replayable message queue
Historical reprocessing can be slow
May not suit all use cases

Medallion Architecture (Lakehouse)

Concept: Progressive data refinement through layers (Bronze → Silver → Gold).

Layer Characteristics:

Layer	Data Quality	Schema	Purpose	Example
Bronze	Raw, as-is	Flexible	Ingestion, lineage	Raw JSON logs, CDC events
Silver	Cleaned, validated	Enforced	Standardized business data	Deduplicated users, validated transactions
Gold	Aggregated, enriched	Optimized	Analytics, ML features	Daily sales by region, user behavior features

Benefits:

Clear data lineage and quality progression
Different SLAs per layer (Bronze: append-only, Silver: updates allowed, Gold: optimized)
Supports both exploratory and production workloads
Incremental quality improvement
Easy debugging (can trace back through layers)

Implementation:

Bronze: Raw Kafka events → Delta table (append-only)
Silver: Deduplicate, validate, join dimensions → Delta table (upserts)
Gold: Aggregate by time/dimension, create features → Delta table (optimized)

6. Interview Checklist

When designing a data analytics platform in an interview:

Step 1: Clarify Requirements (5-10 minutes)

Questions to Ask:

What are the primary data sources? (databases, APIs, logs, IoT)
What's the data volume? (MB/GB/TB per day)
What's the data velocity? (real-time, batch, micro-batch)
What are the query latency requirements? (milliseconds, seconds, minutes)
How many concurrent users? (10s, 100s, 1000s)
What are the retention needs? (days, months, years)
Are there compliance/regulatory requirements? (GDPR, HIPAA)
What's the budget? (cost constraints)

Step 2: Design Ingestion (5-10 minutes)

Decisions to Make:

Batch vs streaming ingestion?
Need for message queue? (Kafka/Kinesis)
CDC for databases? (Debezium)
Schema validation strategy?
Error handling and dead letter queues?
Data partitioning strategy?

Step 3: Choose Processing Approach (5 minutes)

Decisions to Make:

ETL vs ELT approach?
Processing engine? (Spark, Flink, dbt)
Transformation complexity?
Orchestration tool? (Airflow, Dagster, Prefect)
Incremental vs full refresh?

Step 4: Select Storage (10 minutes)

Decisions to Make:

Data lake for raw data?
Warehouse for analytics?
Lakehouse for unified approach?
Partitioning strategy? (time-based, key-based)
Data retention and archival?
File format? (Parquet, Avro, ORC)

Step 5: Plan Consumption (5 minutes)

Decisions to Make:

BI tool integration? (Tableau, Looker, Power BI)
API layer needed? (REST, GraphQL)
ML model serving? (Feature store)
Data access controls? (RBAC, row-level security)
Caching strategy?

Step 6: Address Non-Functional Requirements (5-10 minutes)

Considerations:

Scalability strategy? (horizontal, vertical)
Monitoring and alerting? (data quality, pipeline health)
Data quality checks? (validation, anomaly detection)
Disaster recovery plan? (backups, replication)
Security and compliance? (encryption, access control)
Cost optimization? (tiered storage, auto-scaling)

7. Common Interview Scenarios

Scenario 1: Real-time Analytics Dashboard

Problem Statement: "Design a system to display user activity metrics on a website in near real-time. The dashboard should show active users, page views, clicks, and conversions updated every few seconds."

Requirements Clarification:

Traffic: 100K requests/second
Latency: Display updates within 5-10 seconds
Metrics: Active users (1-min window), page views, clicks, conversions
Retention: Hot data (7 days), historical data (1 year)
Users: 100 concurrent dashboard viewers

Design:

Key Decisions:

Ingestion Layer:
- Kafka for high-throughput event ingestion (handles 100K events/sec easily)
- Partitioned by user_id for parallel processing
Stream Processing:
- Flink for stateful stream processing with exactly-once guarantees
- 5-second tumbling windows for aggregations
- Maintains state for active user tracking (1-minute window)
Storage Strategy:
- Redis: Hot data (last 5 minutes) for dashboard queries
- Snowflake: Historical data for trend analysis
- Dual write pattern: real-time to Redis, batch to Snowflake
API Layer:
- WebSocket/Server-Sent Events for pushing updates to dashboard
- Queries Redis for real-time metrics
- Queries Snowflake for historical context

Scalability:

Kafka partitions: 10 partitions for parallelism
Flink parallelism: 5 task managers
Redis cluster: 3 nodes with replication
Auto-scaling for API layer based on WebSocket connections

Scenario 2: Data Lake for Machine Learning

Problem Statement: "Design a data lake to store and process diverse data sources for training machine learning models. Data includes user behavior logs, transaction records, product catalog, and customer support tickets."

Requirements:

Data sources: Logs (JSON), Databases (MySQL), Files (CSV), Text (tickets)
Volume: 10 TB/day
ML use case: Customer churn prediction, product recommendations
Data scientists: 50 users running ad-hoc queries
Training frequency: Daily batch jobs

Design:

Key Decisions:

Storage Format:
- Bronze (Raw): Original format (JSON, CSV) for lineage
- Silver (Curated): Parquet format for efficient analytics
- Partitioning: By date (/year=2024/month=01/day=15/)
Schema Management:
- Glue Crawler for automatic schema discovery
- Glue Data Catalog as centralized metadata store
- Schema evolution support with Parquet
Feature Engineering:
- Spark jobs in AWS Glue for complex transformations
- Feature Store for ML features (training + serving)
- Separate training and inference pipelines
Access Patterns:
- Data Scientists: Athena for SQL exploration, Jupyter notebooks
- ML Training: SageMaker reading from S3/Feature Store
- Orchestration: Airflow for scheduling ETL and training jobs
Data Governance:
- Data lineage tracking with Apache Atlas
- S3 lifecycle policies for cost optimization (hot → cold storage)
- IAM roles for fine-grained access control

Cost Optimization:

S3 Intelligent-Tiering for automatic cost optimization
Parquet compression (Snappy) reduces storage by 70%
Athena charges per query (pay only for scanned data)
Spot instances for Spark/SageMaker training jobs

Scenario 3: Enterprise Data Warehouse

Problem Statement: "Design a centralized data warehouse for a retail company. The warehouse should consolidate data from multiple OLTP databases, support 500 concurrent BI users, and enable complex analytical queries for sales reporting, inventory management, and customer analytics."

Requirements:

Data sources: 5 MySQL databases (orders, inventory, customers, products, stores)
Update frequency: Every 15 minutes
Query complexity: Complex joins across 10+ tables, window functions
Users: 500 concurrent BI users
SLA: 95% of queries under 10 seconds
Data retention: 5 years

Design:

Key Decisions:

Data Ingestion Strategy:
- CDC with Debezium: Captures changes from MySQL databases
- Kafka: Buffers change events, enables replay
- 15-minute micro-batches: Balances freshness with load
- Separate ETL warehouse from BI warehouse for workload isolation
Data Modeling:
- Star schema for query performance (denormalized dimensions)
- SCD Type 2 for customer dimension (track history)
- Clustering keys: date for fact_sales, store_id for fact_inventory
- Materialized views: Pre-compute common aggregations
Snowflake Architecture:
- ETL Warehouse: X-Large, scheduled for 15-min loads
- BI Warehouse: Large with multi-cluster (1-10) for 500 users
- Auto-scaling: Automatically adds clusters during peak usage
- Auto-suspend: Warehouses suspend after 5 minutes of inactivity
Performance Optimization:
- Clustering keys on large tables (>100 GB)
- Materialized views for daily/weekly aggregations
- Result caching (24 hours)
- Search optimization service for point lookups
Semantic Layer (dbt):
- Centralized business logic and metric definitions
- Data quality tests (not null, unique, referential integrity)
- Documentation and lineage
- Incremental models for large fact tables
High Availability:
- Snowflake multi-AZ deployment
- Kafka replication factor: 3
- Spark cluster with multiple workers
- BI tool load balancing

Query Performance:

95th percentile query time: 8 seconds
Cache hit rate: 60% (repeated queries)
Concurrent query support: 500+ users
Auto-scaling responds to load in <1 minute

Table of Contents​

1. Core Concepts Overview​

ETL (Extract → Transform → Load)​

ELT (Extract → Load → Transform)​

ETL vs ELT Comparison​

2. Storage Systems​

Data Warehouse​

Data Lake​

Lakehouse Architecture​

3. Processing Engines​

Apache Spark​

Snowflake​

4. System Design Considerations​

Data Flow Architecture​

Key Design Questions​

1. Data Characteristics​

2. Access Patterns​

3. Processing Requirements​

4. Scalability Needs​

Storage Selection Matrix​

Scaling Strategies​

Ingestion Scaling​

Storage Scaling​

Query Performance Optimization​

Concurrency Handling​

5. Modern Data Architecture Patterns​

Lambda Architecture​

Kappa Architecture​

Medallion Architecture (Lakehouse)​

6. Interview Checklist​

Step 1: Clarify Requirements (5-10 minutes)​

Step 2: Design Ingestion (5-10 minutes)​

Step 3: Choose Processing Approach (5 minutes)​

Step 4: Select Storage (10 minutes)​

Step 5: Plan Consumption (5 minutes)​

Step 6: Address Non-Functional Requirements (5-10 minutes)​

7. Common Interview Scenarios​

Scenario 1: Real-time Analytics Dashboard​

Scenario 2: Data Lake for Machine Learning​

Scenario 3: Enterprise Data Warehouse​

Table of Contents

1. Core Concepts Overview

ETL (Extract → Transform → Load)

ELT (Extract → Load → Transform)

ETL vs ELT Comparison

2. Storage Systems

Data Warehouse

Data Lake

Lakehouse Architecture

3. Processing Engines

Apache Spark

Snowflake

4. System Design Considerations

Data Flow Architecture

Key Design Questions

1. Data Characteristics

2. Access Patterns

3. Processing Requirements

4. Scalability Needs

Storage Selection Matrix

Scaling Strategies

Ingestion Scaling

Storage Scaling

Query Performance Optimization

Concurrency Handling

5. Modern Data Architecture Patterns

Lambda Architecture

Kappa Architecture

Medallion Architecture (Lakehouse)

6. Interview Checklist

Step 1: Clarify Requirements (5-10 minutes)

Step 2: Design Ingestion (5-10 minutes)

Step 3: Choose Processing Approach (5 minutes)

Step 4: Select Storage (10 minutes)

Step 5: Plan Consumption (5 minutes)

Step 6: Address Non-Functional Requirements (5-10 minutes)

7. Common Interview Scenarios

Scenario 1: Real-time Analytics Dashboard

Scenario 2: Data Lake for Machine Learning

Scenario 3: Enterprise Data Warehouse